Discrete distributions

Discrete distributions#

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import pandas as pd
from rlxutils import subplots
import sys
import init
%matplotlib inline

Discrete (or categorical) distributions#

we will have a joint distribute distritbution of two variables, in the example below these are \(edad\) and \(barrio\).

each variable make take any value from a FINITE set. Observe that \(edad\) is discrete because its value comes binned into age groups.
the possible values for each variable might be sortable in a meaningful way or not. \(edad\) is sortable, \(barrio\) is not, because an alphabetical sorting does not imply any relation.
for instance, that \(edad\) 10-14 < 25-29 represents a true relation of data (younger/older people)
but \(barrio\) Aranjuez < Belen does NOT represent any true relation between the two neighborhoods. It is somewhat arbitrary.

recall that:

the joint probability is the probability of a value of \(edad\) and a value of \(barrio\) for occurring simultaneously. Answers the question: What is the observed proportion of people with age 10-14 and living in Belen.
the marginal probability is the probability of a value of one variable irrespective of the outcome of the another variable. Answers the question: What is the observed proportion of people living in \(belen\)?.
the conditional probability is the probability of one event occurring in the presence of a second event. Answers the question: If we only consider people living in \(belen\), what is the observed proportion of people with ages 10-14?

This is the data of people ages and district in Medellin where they live, taken from medata.gov.co

x = pd.read_csv("local/data/proyecciones_de_poblacion_medellin_2017.csv.gz", delimiter=";")
x['grupo_edad'] = x.grupo_edad.str.strip().str.lower()
x = x.rename({"codigo": "barrio", "grupo_edad": "edad"}, axis=1)
x = x.replace("0-4", '00-04').replace('5-9', '05-09').replace('80 y más', '80-')
x = x[x.edad.str.lower().str.strip()!="total"]
x = x[[("Suma" not in i)&("Total" not in i) for i in x.barrio]]
x = x.groupby(["edad", "barrio" ])[['total_2017']].sum().unstack().T.loc['total_2017']

edad	00-04	05-09	10-14	15-19	20-24	25-29	30-34	35-39	40-44	45-49	50-54	55-59	60-64	65-69	70-74	75-79	80-
barrio
Altavista	3281	3183	3538	3467	3888	3861	3270	2990	3270	2689	1896	1107	717	520	421	367	109
Aranjuez	10047	10206	10357	10976	11936	13804	13695	11557	9851	11339	12631	11616	8990	6352	4103	2551	2904
Belén	8406	9304	9695	12281	14230	15365	15848	14129	11069	13116	16739	16754	13710	10397	6893	4369	5094
Buenos Aires	6968	7288	7212	8611	10025	11343	11011	9835	8276	9630	11503	11146	8586	6071	3991	2713	3046
Castilla	7844	7986	8228	8743	10642	12446	11560	10021	9083	12129	14627	12250	9130	6665	4253	2652	2622
Doce de Octubre	12714	12562	12473	13301	14749	15935	14512	12670	11435	14508	16409	14151	10555	7568	4926	3150	3169
El Poblado	3666	4212	4527	5353	6363	8189	9540	9600	8681	10461	13356	13622	11246	8634	5739	4086	4211
Guayabal	4175	4428	4523	5268	6217	7460	7620	6692	5463	6768	8231	7821	6743	5364	3552	2546	2526
La América	2439	2883	3034	4091	4923	6154	6775	6377	5143	6418	9502	10146	8779	7678	5315	3393	3868
La Candelaria	3111	3453	3715	4038	4557	6305	7479	6949	5157	5944	7206	7179	6452	5097	3330	2382	3304
Laureles - Estadio	2906	3533	3910	4850	5523	7946	9810	9010	6861	7589	10124	11632	11062	9895	6966	5262	5865
Manrique	11440	11314	11312	11756	12422	13701	12536	10458	9325	11208	12622	11293	8057	5460	3568	2238	2360
Palmitas	474	417	543	609	666	646	505	430	634	611	507	361	233	198	109	87	31
Popular	11799	11707	11113	11056	10779	10956	10037	8971	7859	8280	8382	6789	4860	3426	2228	1589	1614
Robledo	10897	11153	11142	12065	13989	14847	13997	12195	10244	11809	13580	12245	9498	6923	4467	2596	2759
San Antonio	8610	8208	9796	9925	11907	11737	10736	9990	11105	8874	6166	4119	2517	1601	1102	875	326
San Cristóbal	6985	6898	8400	8337	9241	9375	8014	7614	8303	6913	4641	3257	1901	1243	887	780	283
San Javier	9775	10642	10930	11672	11885	12005	11317	10307	8614	8838	9205	7705	5679	4049	2620	1693	2239
Santa Cruz	8789	8720	8546	8891	9289	9814	9039	7867	6491	7037	7978	6554	4581	3424	2257	1471	1766
Santa Elena	1559	1491	1725	1670	1785	1997	1757	1495	1514	1488	1077	678	511	353	250	159	50
Villa Hermosa	9971	10346	10228	11280	11901	12581	11909	10075	7885	7939	8694	7737	6052	4552	2970	1952	2470

joint distribution#

we turn it into a joint distribution. This is an empirical distribution, because the data was obtained by counting using some method on the real world and not derived or assumed by some analytical procedure or calculation.

xd = x/x.values.sum()
xd

edad	00-04	05-09	10-14	15-19	20-24	25-29	30-34	35-39	40-44	45-49	50-54	55-59	60-64	65-69	70-74	75-79	80-
barrio
Altavista	0.001308	0.001269	0.001410	0.001382	0.001550	0.001539	0.001304	0.001192	0.001304	0.001072	0.000756	0.000441	0.000286	0.000207	0.000168	0.000146	0.000043
Aranjuez	0.004005	0.004069	0.004129	0.004376	0.004758	0.005503	0.005460	0.004607	0.003927	0.004520	0.005035	0.004631	0.003584	0.002532	0.001636	0.001017	0.001158
Belén	0.003351	0.003709	0.003865	0.004896	0.005673	0.006125	0.006318	0.005633	0.004413	0.005229	0.006673	0.006679	0.005466	0.004145	0.002748	0.001742	0.002031
Buenos Aires	0.002778	0.002905	0.002875	0.003433	0.003996	0.004522	0.004390	0.003921	0.003299	0.003839	0.004586	0.004443	0.003423	0.002420	0.001591	0.001082	0.001214
Castilla	0.003127	0.003184	0.003280	0.003485	0.004242	0.004962	0.004608	0.003995	0.003621	0.004835	0.005831	0.004883	0.003640	0.002657	0.001695	0.001057	0.001045
Doce de Octubre	0.005068	0.005008	0.004972	0.005302	0.005880	0.006353	0.005785	0.005051	0.004559	0.005784	0.006541	0.005641	0.004208	0.003017	0.001964	0.001256	0.001263
El Poblado	0.001461	0.001679	0.001805	0.002134	0.002537	0.003265	0.003803	0.003827	0.003461	0.004170	0.005324	0.005430	0.004483	0.003442	0.002288	0.001629	0.001679
Guayabal	0.001664	0.001765	0.001803	0.002100	0.002478	0.002974	0.003038	0.002668	0.002178	0.002698	0.003281	0.003118	0.002688	0.002138	0.001416	0.001015	0.001007
La América	0.000972	0.001149	0.001210	0.001631	0.001963	0.002453	0.002701	0.002542	0.002050	0.002559	0.003788	0.004045	0.003500	0.003061	0.002119	0.001353	0.001542
La Candelaria	0.001240	0.001377	0.001481	0.001610	0.001817	0.002514	0.002982	0.002770	0.002056	0.002370	0.002873	0.002862	0.002572	0.002032	0.001328	0.000950	0.001317
Laureles - Estadio	0.001158	0.001408	0.001559	0.001933	0.002202	0.003168	0.003911	0.003592	0.002735	0.003025	0.004036	0.004637	0.004410	0.003945	0.002777	0.002098	0.002338
Manrique	0.004561	0.004510	0.004510	0.004687	0.004952	0.005462	0.004998	0.004169	0.003717	0.004468	0.005032	0.004502	0.003212	0.002177	0.001422	0.000892	0.000941
Palmitas	0.000189	0.000166	0.000216	0.000243	0.000266	0.000258	0.000201	0.000171	0.000253	0.000244	0.000202	0.000144	0.000093	0.000079	0.000043	0.000035	0.000012
Popular	0.004704	0.004667	0.004430	0.004407	0.004297	0.004368	0.004001	0.003576	0.003133	0.003301	0.003342	0.002706	0.001937	0.001366	0.000888	0.000633	0.000643
Robledo	0.004344	0.004446	0.004442	0.004810	0.005577	0.005919	0.005580	0.004862	0.004084	0.004708	0.005414	0.004881	0.003786	0.002760	0.001781	0.001035	0.001100
San Antonio	0.003432	0.003272	0.003905	0.003957	0.004747	0.004679	0.004280	0.003983	0.004427	0.003538	0.002458	0.001642	0.001003	0.000638	0.000439	0.000349	0.000130
San Cristóbal	0.002785	0.002750	0.003349	0.003324	0.003684	0.003737	0.003195	0.003035	0.003310	0.002756	0.001850	0.001298	0.000758	0.000496	0.000354	0.000311	0.000113
San Javier	0.003897	0.004242	0.004357	0.004653	0.004738	0.004786	0.004512	0.004109	0.003434	0.003523	0.003670	0.003072	0.002264	0.001614	0.001044	0.000675	0.000893
Santa Cruz	0.003504	0.003476	0.003407	0.003544	0.003703	0.003912	0.003603	0.003136	0.002588	0.002805	0.003180	0.002613	0.001826	0.001365	0.000900	0.000586	0.000704
Santa Elena	0.000621	0.000594	0.000688	0.000666	0.000712	0.000796	0.000700	0.000596	0.000604	0.000593	0.000429	0.000270	0.000204	0.000141	0.000100	0.000063	0.000020
Villa Hermosa	0.003975	0.004124	0.004077	0.004497	0.004744	0.005015	0.004748	0.004016	0.003143	0.003165	0.003466	0.003084	0.002413	0.001815	0.001184	0.000778	0.000985

# it must add up to 1
xd.values.sum()

1.0

marginal distribution#

This are the TWO marginal distributions, for each one of the variables

dbarrio = xd.sum(axis=1)
dedad   = xd.sum(axis=0)

dbarrio.sum(), dedad.sum()

(1.0, 1.0)

for ax,i in subplots(2, usizex=5, usizey=3.5):
    if i==0: dbarrio.plot(kind="bar")
    if i==1: dedad.plot(kind="bar")
    plt.grid()
    plt.ylim(0,0.1)
plt.tight_layout()
        

../_images/8a716a4f871678e25405b9e0a303c38073e15dce8cf42e49f0d49172b5e22ef5.png

conditional distribution#

we compute it for one variable with respect to a specific value of the other one.

This is

\[P(\text{edad}|\text{barrio}=\text{belen})\]

observe that we obtain it from the join distribution but WE MUST NORMALIZE so we have a true distribution adding up to 1.

This normalization will become very important later on in the course.

# unnormalized conditional

xd.loc['Belén']

edad
00-04    0.003351
05-09    0.003709
10-14    0.003865
15-19    0.004896
20-24    0.005673
25-29    0.006125
30-34    0.006318
35-39    0.005633
40-44    0.004413
45-49    0.005229
50-54    0.006673
55-59    0.006679
60-64    0.005466
65-69    0.004145
70-74    0.002748
75-79    0.001742
80-      0.002031
Name: Belén, dtype: float64

# it does not add up to one
xd.loc['Belén'].sum()

0.07869355283657012

# we normalized it

dbelen = xd.loc['Belén'] / xd.loc['Belén'].sum()
print ("check sum =", dbelen.sum())
dbelen

check sum = 1.0

edad
00-04    0.042584
05-09    0.047133
10-14    0.049114
15-19    0.062214
20-24    0.072087
25-29    0.077837
30-34    0.080284
35-39    0.071576
40-44    0.056074
45-49    0.066444
50-54    0.084798
55-59    0.084874
60-64    0.069453
65-69    0.052670
70-74    0.034919
75-79    0.022133
80-      0.025806
Name: Belén, dtype: float64

dbelen.plot(kind='bar', figsize=(6,2))
plt.grid();

../_images/4d8d1b93f04d575b80d2c4fbb6cfac3a9453473f8c95d94c8677141cb540ec49.png

Sometimes we write

\[P(\text{edad}|\text{barrio})\]

without specifying the value of the conditioning variable, but assuming someone has decided upon a certain value. You must pay attention to the context in which this is being used to understand well how to compute or use this conditional distribution.

In fact, for each value of \(barrio\) we have a different distritbuion.

independance#

Observe carefully. If all conditional distributions look the same this suggests that both variables are independant, \(\rightarrow\) knowing something about one does not tell us anything about the other one.

for ax,barrio in subplots(xd.index, usizex=5, usizey=3, n_cols=4):
    dmarginal = xd.loc[barrio] / xd.loc[barrio].sum()
    dmarginal.plot(kind='bar', ax=ax)
    plt.title(barrio)
    plt.ylim(0,.11)
    plt.grid();
plt.tight_layout()

../_images/cdad5e388708a41528afd5da4a84e23bfb1ce03625c83fbcc6a961e0334db1ce.png

or for each value of \(edad\)

for ax,edad in subplots(xd.columns, usizex=5, usizey=3, n_cols=4):
    dmarginal = xd[edad] / xd[edad].sum()
    dmarginal.plot(kind='bar', ax=ax)
    plt.title(edad)
    plt.ylim(0,.11)
    plt.grid();
plt.tight_layout()

../_images/37178176030fc71a43ea0355838c55a35988532c059189ad1ed2ccde9f7aebfa.png